Load Packages

library(tidyverse)
library(dplyr)
library(ggplot2)
library(usmap)
library(ggthemes)
<<<<<<< HEAD

Load Datasets

=======

Datasets and Descriptions

>>>>>>> 9ac508d97c46405daa9b11122e0c24b3166c1d75

MATT PUT IN DATA INFO HERE (when corona cases stopped tracking, what is pov dataset)

covidData <- read.csv("coronacounties.csv")
povertyData <- read.csv("PovertyEstimates.csv")
populationData <- usmap::countypop
populationData$fips <- as.integer(populationData$fips)
<<<<<<< HEAD

Exploratory Data Analysis

=======

Section 1: Introduction

Exploratory Data Analysis

>>>>>>> 9ac508d97c46405daa9b11122e0c24b3166c1d75

First, let’s start by glimpsing the 2 datasets.

glimpse(covidData)
## Observations: 61,971
## Variables: 6
## $ date   <fct> 2020-01-21, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-24, 20…
## $ county <fct> Snohomish, Snohomish, Snohomish, Cook, Snohomish, Orange, Cook…
## $ state  <fct> Washington, Washington, Washington, Illinois, Washington, Cali…
## $ fips   <int> 53061, 53061, 53061, 17031, 53061, 6059, 17031, 53061, 4013, 6…
## $ cases  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ deaths <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
covidData %>%
  group_by(fips) %>%
  count()
## # A tibble: 2,709 x 2
## # Groups:   fips [2,709]
##     fips     n
##    <int> <int>
##  1  1001    23
##  2  1003    33
##  3  1005    13
##  4  1007    17
##  5  1009    22
##  6  1011    21
##  7  1013    22
##  8  1015    29
##  9  1017    28
## 10  1019    22
## # … with 2,699 more rows

At first glimpse, the COVID-19 dataset has 61,971 observations and 6 variables. However, after grouping by FIPS code, we see that the dataset contains data on 2,709 counties, with repeated observations representing cases on different dates.

glimpse(povertyData)
## Observations: 3,193
## Variables: 34
## $ FIPStxt                         <int> 0, 1000, 1001, 1003, 1005, 1007, 1009…
## $ Stabr                           <fct> US, AL, AL, AL, AL, AL, AL, AL, AL, A…
## $ Area_name                       <fct> United States, Alabama, Autauga Count…
## $ Rural.urban_Continuum_Code_2003 <int> NA, NA, 2, 4, 6, 1, 1, 6, 6, 3, 6, 8,…
## $ Urban_Influence_Code_2003       <int> NA, NA, 2, 5, 6, 1, 1, 6, 6, 2, 5, 6,…
## $ Rural.urban_Continuum_Code_2013 <int> NA, NA, 2, 3, 6, 1, 1, 6, 6, 3, 6, 6,…
## $ Urban_Influence_Code_2013       <int> NA, NA, 2, 2, 6, 1, 1, 6, 6, 2, 5, 6,…
## $ POVALL_2018                     <fct> "41,852,315", "801,758", "7,587", "21…
## $ CI90LBAll_2018                  <fct> "41,619,366", "785,668", "6,334", "17…
## $ CI90UBALL_2018                  <fct> "42,085,264", "817,848", "8,840", "24…
## $ PCTPOVALL_2018                  <dbl> 13.1, 16.8, 13.8, 9.8, 30.9, 21.8, 13…
## $ CI90LBALLP_2018                 <dbl> 13.0, 16.5, 11.5, 8.1, 25.8, 17.1, 10…
## $ CI90UBALLP_2018                 <dbl> 13.2, 17.1, 16.1, 11.5, 36.0, 26.5, 1…
## $ POV017_2018                     <fct> "12,997,532", "255,613", "2,509", "6,…
## $ CI90LB017_2018                  <fct> "12,873,127", "247,744", "1,965", "4,…
## $ CI90UB017_2018                  <fct> "13,121,937", "263,482", "3,053", "8,…
## $ PCTPOV017_2018                  <dbl> 18.0, 23.9, 19.3, 13.9, 43.9, 27.8, 1…
## $ CI90LB017P_2018                 <dbl> 17.8, 23.2, 15.1, 10.2, 35.0, 20.7, 1…
## $ CI90UB017P_2018                 <dbl> 18.2, 24.6, 23.5, 17.6, 52.8, 34.9, 2…
## $ POV517_2018                     <fct> "8,930,152", "178,175", "1,891", "4,5…
## $ CI90LB517_2018                  <fct> "8,834,521", "171,349", "1,469", "3,2…
## $ CI90UB517_2018                  <fct> "9,025,783", "185,001", "2,313", "5,8…
## $ PCTPOV517_2018                  <dbl> 17.0, 22.8, 19.5, 13.1, 36.7, 26.3, 1…
## $ CI90LB517P_2018                 <dbl> 16.8, 21.9, 15.1, 9.3, 27.5, 19.0, 10…
## $ CI90UB517P_2018                 <dbl> 17.2, 23.7, 23.9, 16.9, 45.9, 33.6, 2…
## $ MEDHHINC_2018                   <fct> "61,937", "49,881", "59,338", "57,588…
## $ CI90LBINC_2018                  <fct> "61,843", "49,123", "53,628", "54,437…
## $ CI90UBINC_2018                  <fct> "62,031", "50,639", "65,048", "60,739…
## $ POV04_2018                      <fct> "3,758,704", "73,915", "", "", "", ""…
## $ CI90LB04_2018                   <fct> "3,714,862", "69,990", "", "", "", ""…
## $ CI90UB04_2018                   <fct> "3,802,546", "77,840", "", "", "", ""…
## $ PCTPOV04_2018                   <dbl> 19.5, 26.0, NA, NA, NA, NA, NA, NA, N…
## $ CI90LB04P_2018                  <dbl> 19.3, 24.6, NA, NA, NA, NA, NA, NA, N…
## $ CI90UB04P_2018                  <dbl> 19.7, 27.4, NA, NA, NA, NA, NA, NA, N…

The Poverty dataset has 3,193 observations (counties, states, and the US) and 34 variables.

Both the COVID-19 dataset and the Poverty dataset have a variation of the variable fips, which is the Federal Information Processing Standard [https://en.wikipedia.org/wiki/FIPS_county_code] that allows counties to be uniquely identified.

Thus, the FIPS variable is a good variable by which to join these 2 datasets:

data <- merge(covidData, povertyData, by.x = 'fips', by.y = 'FIPStxt')
glimpse(data)
## Observations: 61,164
## Variables: 39
## $ fips                            <int> 1001, 1001, 1001, 1001, 1001, 1001, 1…
## $ date                            <fct> 2020-03-27, 2020-04-07, 2020-03-25, 2…
## $ county                          <fct> Autauga, Autauga, Autauga, Autauga, A…
## $ state                           <fct> Alabama, Alabama, Alabama, Alabama, A…
## $ cases                           <int> 6, 12, 4, 12, 6, 19, 10, 25, 23, 7, 1…
## $ deaths                          <int> 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0…
## $ Stabr                           <fct> AL, AL, AL, AL, AL, AL, AL, AL, AL, A…
## $ Area_name                       <fct> Autauga County, Autauga County, Autau…
## $ Rural.urban_Continuum_Code_2003 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Urban_Influence_Code_2003       <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Rural.urban_Continuum_Code_2013 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Urban_Influence_Code_2013       <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ POVALL_2018                     <fct> "7,587", "7,587", "7,587", "7,587", "…
## $ CI90LBAll_2018                  <fct> "6,334", "6,334", "6,334", "6,334", "…
## $ CI90UBALL_2018                  <fct> "8,840", "8,840", "8,840", "8,840", "…
## $ PCTPOVALL_2018                  <dbl> 13.8, 13.8, 13.8, 13.8, 13.8, 13.8, 1…
## $ CI90LBALLP_2018                 <dbl> 11.5, 11.5, 11.5, 11.5, 11.5, 11.5, 1…
## $ CI90UBALLP_2018                 <dbl> 16.1, 16.1, 16.1, 16.1, 16.1, 16.1, 1…
## $ POV017_2018                     <fct> "2,509", "2,509", "2,509", "2,509", "…
## $ CI90LB017_2018                  <fct> "1,965", "1,965", "1,965", "1,965", "…
## $ CI90UB017_2018                  <fct> "3,053", "3,053", "3,053", "3,053", "…
## $ PCTPOV017_2018                  <dbl> 19.3, 19.3, 19.3, 19.3, 19.3, 19.3, 1…
## $ CI90LB017P_2018                 <dbl> 15.1, 15.1, 15.1, 15.1, 15.1, 15.1, 1…
## $ CI90UB017P_2018                 <dbl> 23.5, 23.5, 23.5, 23.5, 23.5, 23.5, 2…
## $ POV517_2018                     <fct> "1,891", "1,891", "1,891", "1,891", "…
## $ CI90LB517_2018                  <fct> "1,469", "1,469", "1,469", "1,469", "…
## $ CI90UB517_2018                  <fct> "2,313", "2,313", "2,313", "2,313", "…
## $ PCTPOV517_2018                  <dbl> 19.5, 19.5, 19.5, 19.5, 19.5, 19.5, 1…
## $ CI90LB517P_2018                 <dbl> 15.1, 15.1, 15.1, 15.1, 15.1, 15.1, 1…
## $ CI90UB517P_2018                 <dbl> 23.9, 23.9, 23.9, 23.9, 23.9, 23.9, 2…
## $ MEDHHINC_2018                   <fct> "59,338", "59,338", "59,338", "59,338…
## $ CI90LBINC_2018                  <fct> "53,628", "53,628", "53,628", "53,628…
## $ CI90UBINC_2018                  <fct> "65,048", "65,048", "65,048", "65,048…
## $ POV04_2018                      <fct> , , , , , , , , , , , , , , , , , , ,…
## $ CI90LB04_2018                   <fct> , , , , , , , , , , , , , , , , , , ,…
## $ CI90UB04_2018                   <fct> , , , , , , , , , , , , , , , , , , ,…
## $ PCTPOV04_2018                   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ CI90LB04P_2018                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ CI90UB04P_2018                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
state_cases <- data %>%
  filter(date == "2020-04-15") %>%
  group_by(state) %>%
  summarise(total_cases = sum(cases))

cases_map <- plot_usmap(data = state_cases, values = "total_cases", labels = TRUE, label_color = "gray", color = "red")

cases_map <- cases_map + 
              scale_fill_gradient(low = "white", high = "#CB454A", name = "Cases") +
              labs(title = "Cases by State on April 15, 2020") + 
              labs(subtitle = "Most cases are concentrated in New York State.") + 
              theme(legend.position = "right")

cases_map
<<<<<<< HEAD

Need to add histograms for the response variable we choose, the predictors we choose, and plots for the response v. predictors relationships

=======

Need to add histograms for the response variable we choose, the predictors we choose, and plots for the response v. predictors relationships (We can do this in the section following - response variable description and predictor descriptions)

>>>>>>> 9ac508d97c46405daa9b11122e0c24b3166c1d75
county_cases <- data %>%
  filter(date == "2020-04-15")

county_cases <- merge(x = populationData, y = county_cases, by = "fips", all = TRUE)

<<<<<<< HEAD
county_cases$cases[is.na(county_cases$cases)] <- 0

county_cases$log_cases <- log(county_cases$cases)
county_cases$log_cases[is.na(county_cases$log_cases)] <- 0

county_cases_map <- plot_usmap("counties", data = county_cases, values = "log_cases", color = "red")

county_cases_map <- county_cases_map + 
              scale_fill_gradient(low = "white", high = "#CB454A", name = "Log Cases") +
=======
county_cases$log_cases <- log(county_cases$cases)
county_cases$cases[is.na(county_cases$cases)] <- 0
county_cases$log_cases[is.na(county_cases$log_cases)] <- 0

county_cases_map <- plot_usmap("counties", data = county_cases, values = "cases", color = "red", size=0.01)

county_cases_map <- county_cases_map + 
              scale_fill_gradient(low = "white", high = "#CB454A", name = " Cases") +
>>>>>>> 9ac508d97c46405daa9b11122e0c24b3166c1d75
              labs(title = "Cases by County on April 15, 2020") + 
              labs(subtitle = "Most cases are concentrated in New York State.") + 
              theme(legend.position = "right")

county_cases_map
<<<<<<< HEAD

=======

county_cases_map <- plot_usmap("counties", data = county_cases, values = "log_cases", color = "red", size=0.01)

county_cases_map <- county_cases_map + 
              scale_fill_gradient(low = "white", high = "#CB454A", name = "Log Cases") +
              labs(title = "Log Cases by County on April 15, 2020") + 
              labs(subtitle = "Most cases are concentrated in New York State.") + 
              theme(legend.position = "right")

county_cases_map

plot_poverty_data <- data <- merge(populationData, povertyData, by.x = 'fips', by.y = 'FIPStxt')

county_cases_map <- plot_usmap("counties", data = plot_poverty_data, values = "PCTPOVALL_2018", color = "red", size=0.01)

county_cases_map <- county_cases_map + 
              scale_fill_gradient(low = "white", high = "#CB454A", name = "PCTPOVALL_2018") +
              labs(title = "Poverty Level by County (2018 values)") + 
              theme(legend.position = "right")

county_cases_map

We should look at histograms of both cases and log(cases) to see if the apparent more even spread of log(cases) is normal compared to regular cases

county_cases %>%
  ggplot(mapping = aes(x = cases)) +
  geom_histogram(binwidth = 600) + 
  labs(title = "Histogram of Number of Cases per County", x = "Number of Cases",
       y = "# Counties with X Cases")

As we predicted, we see an incredible amount of right skewness in the distribution of cases per county.

county_cases %>%
  ggplot(mapping = aes(x = log(cases))) +
  geom_histogram(binwidth = 0.5) + 
  labs(title = "Histogram of Number of log of # Cases per County", x = "Number of Cases",
       y = "# Counties with log(X) Cases")
## Warning: Removed 450 rows containing non-finite values (stat_bin).

This historam of the log(cases) per county shows much more evidence of normailty than the graph of just cases. It would be a much better response variable, however we should analyze some other options as well.

Another possible option to analyze as a response would be the number of cases per capita of a county. This may work better as it would correlate significantly less with population as we have “accounted” for this already in the response.

county_cases <- county_cases %>%
  mutate(casesPC = cases/pop_2015)

casesPC_map <- plot_usmap("counties", data = county_cases, values = "casesPC", color = "red", size=0.01)

casesPC_map <- casesPC_map + 
              scale_fill_gradient(low = "white", high = "limegreen", name = "Cases Per Capita") +
              labs(title = "Cases per Capita (by county)") + 
              theme(legend.position = "right")

casesPC_map

county_cases %>%
  ggplot(mapping = aes(x = casesPC)) +
  geom_histogram(binwidth = 0.0001) + 
  labs(title = "Histogram of Number of  Cases/Capita per County", x = "Number of Cases/Capita",
       y = "# Counties with X Cases per Capita")

Similarly to the map of cases per county, we see a large discrepancy between counties and may benefit from mapping the log of the cases per capita in order to better normalize the distribution. The histogram confirms this theory as it is majorly right skewed.

county_cases$log_casesPC <- log(county_cases$casesPC)
county_cases$casesPC[is.na(county_cases$casesPC)] <- 0
county_cases$log_casesPC[is.na(county_cases$log_casesPC)] <- 0

logcasesPC_map <- plot_usmap("counties", data = county_cases, values = "log_casesPC", color = "red", size=0.01)

logcasesPC_map <- logcasesPC_map + 
              scale_fill_gradient(low = "white", high = "limegreen", name = "Log of Cases per Capita") +
              labs(title = "Log of Cases per Capita (by county)") + 
              theme(legend.position = "right")

logcasesPC_map

county_cases %>%
  ggplot(mapping = aes(x = log_casesPC)) +
  geom_histogram(binwidth = 0.3) + 
  labs(title = "Histogram of Number of log(Cases/Capita per County)", x = "Number of log(Cases/Capita)",
       y = "# Counties with log(X) Cases per Capita")
## Warning: Removed 450 rows containing non-finite values (stat_bin).

We see a much more even distribution using the log of cases per capita. It is important to note that many counties appear as grey in this map; these counties have zero cases and thus the log of them returns the value of negative infinity causing this. We can remove or standardize them in our later analysis. The histogram of the log of cases per capita is very normal which would be perfect for analysis.

Description of the Response Variable

Descriptions of the Regressors

Section 2: Regression Analysis

Generating Our Model

Testing Our Assumptions

Multicollinearity

Interaction Effects

Influential Points

Interpreting The Coefficients

Section 3: Conclusion

General Commentary

Discussion of Results

Limitations and Ideas for Further Research

Exploratory Data Analysis

>>>>>>> 9ac508d97c46405daa9b11122e0c24b3166c1d75

Regression

Ideas for regression - Linear regression: predict from multiple factors (FIP code, poverty rate, race breakdown, etc.) their number of cases or deaths - Logistic regression: predict if factors contribute to a death or not (don’t think we have specific data on specific cases so this could be challenging)

Assumptions for regression - Independence - Linearity - Normality - Constant Variance

This website has specific graphs that would be helpful to replicate, but don’t know where to get the data on an aggregate level: [https://www.apmresearchlab.org/covid/deaths-by-race#black]